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ABSTRACT 

Memory controllers have used static page closure poli¬ 
cies to decide whether a row should be left open, open- 
page policy, or closed immediately, close-page policy, af¬ 
ter the row has been accessed. The appropriate choice 
for a particular access can reduce the average mem¬ 
ory latency. However, since application access patterns 
change at run time, static page policies cannot guar¬ 
antee to deliver optimum execution time. Hybrid page 
policies have been investigated as a means of covering 
these dynamic scenarios and are now implemented in 
state-of-the-art processors. Hybrid page policies switch 
between open-page and close-page policies while the ap¬ 
plication is running, by monitoring the access pattern of 
row hits/conflicts and predicting future behavior. Un¬ 
fortunately, as the size of DRAM memory increases, 
fine-grain tracking and analysis of memory access pat¬ 
terns does not remain practical. 

We propose a compact memory address-based encod¬ 
ing technique which can improve or maintain the perfor¬ 
mance of DRAMs page closure predictors while reduc¬ 
ing the hardware overhead in comparison with state- 
of-the-art techniques. As a case study, we integrate 
our technique, HAPPY, with a state-of-the-art moni¬ 
tor - the Intel-adaptive open-page policy predictor em¬ 
ployed by the Intel Xeon X5650 - and a traditional 
Hybrid page policy. We evaluate them across 70 mem¬ 
ory intensive workload mixes consisting of single-thread 
and multi-thread applications. The experimental re¬ 
sults show that using the HAPPY encoding applied to 
the Intel-adaptive page closure policy can reduce the 
hardware overhead by 5 x for the evaluated 64 GB mem¬ 
ory (up to 40 X for a 512 GB memory) while maintaining 
the prediction accuracy. 

1. INTRODUCTION 

The performance of DRAM is sensitive to the mem¬ 
ory access pattern of the running applications. Tradi¬ 
tionally DRAM controllers have used a static row-buffer 
access policy, either open-page or close-page, to decide 
whether a row should be left open or closed immedi¬ 
ately after their access. For workloads with high local¬ 
ity of accesses open-page works best since the target 
row is already open and multiple accesses to that row 
can be serviced with one activation. However, for work¬ 
loads with more random memory accesses, close-page is 


a better option. In this case a row will be closed im¬ 
mediately after a memory access so the next memory 
request within the same bank does not need to wait 
for the precharge process of the open row. Moreover, 
neither the open-page nor close-page policy can deliver 
the ‘best’ execution time for all the workloads due to the 
dynamic nature of memory accesses. In this situation a 
hybrid-page policy, which is a mixture of open-page and 
close-page, is more desirable. 

Different techniques have been proposed in the lit¬ 
erature to select between open-page and close-page in 
DRAM memory controllers. Access-based techniques 
are those that monitor and keep a history of the row 
hits and row misses at different granularity in DRAMs 
to make a prediction of the future page closure pol¬ 
icy. On the other hand, time-based techniques focus 
on predicting the optimum time that a row can be left 
open. In general, these techniques rely on predictors 
that monitor the number of accesses per row, the num¬ 
ber of row hits or row misses, the time between hits or 
misses, etc. to predict the open-page or close-page for 
each row in DRAM. Intel included in the Xeon X5650 
two time-based techniques. 

As the size of DRAM is increasing with Data An¬ 
alytic applications, having a fine-grain prediction and 
monitoring scheme is inefficient and not scalable. On 
the other hand going toward the coarse-grain schemes 
reduces the accuracy of the prediction. A key challenge 
for page-closure techniques is to balance the hardware 
overhead and the prediction accuracy. 

The trend towards keeping entire databases in DRAMs, 
such as RAMGloud (e.g. 64 TB of DRAMs) or Face- 
book using 150 TB of DRAMs with memcache [^, turns 
the scalability issue into a critical problem for future 
DRAM systems. 

Our contribution is a scalable and compact mem¬ 
ory address-based encoding technique, called HAPPY, 
that can be employed in DRAM memory controllers. 
HAPPY is an efficient encoding that reduces the cost 
of implementation of existing page closure techniques 
while maintaining the prediction accuracy of the orig¬ 
inal implementation. As case studies, we show how to 
integrate HAPPY with a state-of-the-art implementa¬ 
tion - the Intel-adaptive open-page policy employed by 
Intel - and with a traditional hybrid-page. We eval¬ 
uate HAPPY and existing techniques across 70 mem- 


ory intensive workload mixes consisting of single-thread 
and multi-thread applications. The experimental re¬ 
sults show that using the HAPPY memory address- 
based encoding applied to the Intel-adaptive page pol¬ 
icy can reduce the hardware cost of implementation by 
5x for the evaluated 64 GB memory while maintaining 
the prediction accuracy. In other words, we can achieve 
similar, or better, performance as existing high perfor¬ 
mance industry and academic techniques while requir¬ 
ing less hardware overhead. 

2. BACKGROUND AND MOTIVATION 

DRAM Structure: Figure presents a high-level 
structure of a typical DRAM organization. A DRAM 
device (Figure [la| consists of multiple banks, each of 
which includes a data array and a sense amplifier or 
row bujfer. The data array is a matrix of rows and 
columns comprising the storage cells. The basic oper¬ 
ation of DRAMs requires that to access a specific cell 
within a bank the entire row (e.g. I KB data) has to be 
moved into the row buffer. Then, read or write opera¬ 
tions can be performed on the data stored in the row 
buffer. Although the banks within a DRAM device can 
be accessed in parallel, since they share the communi¬ 
cation bus only one bank at the time can transfer data 
out of the DRAM device. Each DRAM device typically 
supports read/write operations of 4-16 bits per mem¬ 
ory request depending on the DRAM model. To sup¬ 
port the required bandwidth, multiple DRAM devices 
work in parallel within a Rank (Figure [Tb] ). In modern 
DRAMs, 64-bit data can be read/written per cycle and 
typically a burst of size 4 or 8 is supported by these 
modules to fill a full cache-line @ 0 - 



(a) DRAM (b) DRAM Rank 

Device 


Figure 1: DRAM Structure. 

DRAM Basic Operation: To perform a read or 
write operation, the target row has to be opened first 
using an activation command which transfers a row to 
the row buffer, imposing a delay tucD- When the row is 
in the row buffer, a read or write command can be issued 
with a delay tcL- Considering the internal structure of 
DRAMs, only one row can be processed at a time. Thus, 
to access to a different row (within the same bank), 
the open row has to be closed first using the precharge 
command with a delay tup. This command prepares 
the row buffer to accept the new row. Considering the 
basic operation of DRAMs, each memory request can 
be classified into one of the following three categories 
depending on the status of the bank to be accessed: 


page-hit, page-miss or page-empty. 

A page-hit is defined as a read/write operation to 
an open row within a bank. In this situation there is 
no need to use an activation command and the memory 
request can be serviced immediately. A page-miss is de¬ 
fined as a read/write operation to a different row than 
the open row within a bank. In this situation the open 
row must first be closed before accessing the second row. 
Finally, a page-empty is defined as a read/write com¬ 
mand to a bank that has no open row in the row buffer. 
In this case an activation command is required to open 
the target row. Page-misses are the most expensive 
memory request while the page-hits are the cheapest to 
service. Page-empties are cheaper than the page con¬ 
flicts but more expensive than the page-hits. 

DRAM Static Page Closure Policies: DRAM 
memory controllers have a page closure policy, to alle¬ 
viate the effect of page-misses on the memory system’s 
performance. The traditional schemes are the open- 
page and the close-page policy. A DRAM that uses the 
open-page policy would leave the last accessed row open 
in the row buffer to eliminate the activation cost of the 
next memory request to the same row. A DRAM that 
uses the close-page policy would close the row immedi¬ 
ately after it has been accessed to eliminate the possibil- 

B of getting a page-miss for the next memory request 
. In general, the open-page policy is more beneficial 
the systems with high access locality whereas the 
close-page policy is more appropriate for systems with 
high entropy memory access. Table presents the tim¬ 
ing cost of page-hits and page-misses when using the 
static page closure policies. 


Page Policy 

Page Hit 

Page Miss 

Open-Page 

tCL 

tRCD + tcL + tup 

Close-Page 

tRCD + tcL 

tRCD + tcL 

Static Profiling 

tCL 

tRCD + tcL 


Table 1: Cost of Page-hits and Page-misses when 
using different page closure policies. 


Motivation: Figure [^depicts the normalised execu¬ 
tion time of all the workloads (to open-page policy) that 
is used in this paper using open-page and close-page 
policy. The results show that around 68% of workloads 
prefer the open-page policy while 32% of workloads de¬ 
liver a better performance using the close-page policy. 
According to this figure, a memory system that employs 
the open-page policy can save up to 18%, in comparison 
with the close-page policy, when running ‘libquantunm’ 
from SPEC benchmark and, at the same time, might 
lose up to 18% when running ‘tigr’ from BIOBENCH 
benchmark. Therefore, there is almost a 40% of perfor¬ 
mance variation in the system depending on the static 
page policy that a memory controller employs. This 
motivates to start thinking about developing dynamic 
page policies that switch between open and close-page 
policy at run time based on the application access be¬ 
havior. The Static Profiling presented in Table shows 
the cost of page-hits and page-misses when the mem- 
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Figure 2: Performance of Static Page Policies for standard workloads. 


ory controller selects the best static page closure policy 
scheme for each workloads by static profiling of mem¬ 
ory accesses. The static profiling provides a baseline 
to evaluate the performance of dynamic page closure 
policies discussed in this paper. 

Motivated by this kind of observations, hybrid-page 
closure policies emerged. This type of page policies uses 
various prediction algorithms to switch dynamically be¬ 
tween open- and close-page according to the applica¬ 
tion access behavior and improve performance. Pre¬ 
diction accuracy and scalability with increasing mem¬ 
ory size are the two main constraints when designing 
such page closure predictors. For most techniques in 
the literature, there is a linear relationship between the 
DRAM memory size and the required resources to mon¬ 
itor the memory access pattern. Thus, as the memory 
size grows, the required cost of page closure policy pre¬ 
dictor grows and, as a result, the on-chip memory con¬ 
troller complexity and area increase. 

Overall, considering the scalability of emerging mem¬ 
ory systems like HMCs, increasing the interest of us¬ 
ing a large amount of DRAM instead of disk storage in 
servers and database analytic applications such as us¬ 
ing 64 TB of DRAMs in RAMCloud and using 

150 TB of DRAMs by Facebook in memcache all de¬ 
mand a scalable approach for efficient design. Section 
[^introduces HAPPY as a compact encoding scheme to 
address the scalability problem of the page closure pol¬ 
icy prediction technique for DRAM memory systems. 

3. HAPPY:HYBRID ADDRESS-BASED PAGE 
POLICY 

3.1 HAPPY - Basic Principles 

HAPPY is a compact memory address-based encod¬ 
ing built on the observation that there is a strong cor¬ 
relation between physical address bits and the internal 
structure of DRAMs. Understanding the basic oper¬ 
ation of DRAMs shows that one of the first steps to 
access to the DRAM structure is the address mapping 
process. During this process, the physical address bits 
provided by a core are translated to the correspond¬ 
ing channel, rank, bank, row and column of a DRAM 


device using a fixed and pre-dehned address-mapping 
algorithm. Having a fixed translation mapping creates 
a strong correlation between physical address bits and 
the DRAMs structure. It means that, if some useful 
information can be extracted from the physical address 
bits after translation, it is possible to extract the same 
information before this stage. 

All the page closure algorithms proposed so far fo¬ 
cus on monitoring the memory access behavior after 
the translation phase. In general, these techniques use 
different performance counters in a channel, bank or 
row basis to monitor page hits/conflicts, the time that 
a row could be kept open etc. HAPPY proposes a 
novel binary-encoding scheme with performance coun¬ 
ters storing page closure history directly from the physi¬ 
cal address bits. In the other words, HAPPY introduces 
one predictor per physical address bit to forecast the 
page closure policy of each row in the memory system 
according to the run-time memory accesses. 

Sections |T^ and |33| show how HAPPY can be applied 
to the two most main page closure categories: access- 
based and time-based techniques. We illustrate one tra¬ 
ditional and one state-of-the-art technique to demon¬ 
strate how the HAPPY encoding can be applied to dif¬ 
ferent systems with different implementation character¬ 
istics. The methodology proposed in this paper can be 
applied to other aspects of a DRAM structure as well. 

3.2 HAPPY - Access-based Prediction 

To demonstrate how HAPPY can be applied to access- 
based algorithms we select the traditional hybrid-page 
policy algorithm. This employs simple, saturating coun¬ 
ters to monitor the memory access pattern behavior and 
dynamically switch between open- and close-page pol¬ 
icy at run time. Figure [^depicts the basic structure of 
such page closure policy predictors. 

In this technique, one saturating counter (e.g. a 2-bit 
counter initialized to zero - open-page policy) is assigned 
to each row of a DRAM bank. Every time that a row- 
miss happens the corresponding counter is incremented. 
Whenever a row-hit occurs the counter is decremented. 
For each memory request, the accessed row’s counter 
value determines the page closure policy; if the value 
is 0 or 1 the open-page policy is predicted and if the 
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Figure 3: Basic structure of Hybrid Page Policy. 


value of the counter is 2 or 3 the close-page policy is 
predicted. However, having a counter for each row in a 
DRAM device imposes a high area and power overhead 
to the memory system. For example, a 4 GB DRAM 
memory system with 1 channel, 2 ranks, 8 banks and 
32,768 rows per bank, require 524,288 counters, which 
is not scalable (analysis presented in Section]^. 

Figure|^depicts the HAPPY implementation of a Hy¬ 
brid page policy. The binary representation of the re¬ 
quested physical address is a pattern of zeros and ones. 
HAPPY dedicates two encoding counters per physical 
address bit location: one connter to monitor the position 
when its valne is aAYlaAZ and one to monitor it when 
it is aAYOaAZ. Training of these counters is similar to 
the original implementation of Hybrid; that means for 
every page conflict the connter corresponding to each 
physical address bit is incremented and for every page 
hit the same counter will be decremented. Considering 
the page-hits and page-misses happen in the row ba¬ 
sis then there is no need to monitor all the available 
physical address bits. Thus, the corresponding physical 
address bit to colnmns and cache-lines offset will not 
be nsed. This reduces even further the implementation 
costs of HAPPY. 
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ing a simple example. Each physical address bit has 
a connter which has its own standalone vote to choose 
an open- or close-page policy for the requested physical 
address. The page closure policy vote of each bit can be 
extracted by looking at the more significant bit of the 
satnrating counters. If this bit is aAYOaAZ there is an 
open-page policy vote and if the value is aAYlaAZ there 
is a close-page policy vote. As the name of decision im¬ 
plies, the final vote is determined by the majority vote 
across all the counters. 


Close Page Votes: 9 
Open Page Votes: 10 

I_^; 

Final Decision: Open Page 
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Physical Address -► 0xC66EB 

Figure 5: Example of HAPPY Majority vote decision. 


Aggregation: the final page closure policy decision 
can also be determined by comparing the aggregation 
of the all the connters values. Equation Q, against a 
certain threshold. Equation 0. 

AddressBitCounters < Threshold —>• Open Page 

else —)• Close Page 

( 1 ) 

, , , AddrBitsWidth x CounterValue 
Threshold = - - - (2) 

The experimental results show similar results using 
either majority or aggregation. Thns only the majority 
vote decision scheme is used in the final experimental 
results. 

3.3 HAPPY - Time-based Prediction 

As a case study to show how HAPPY can be applied 
to a time-based prediction algorithm we chose the Intel- 
adaptive open-page policy 0 [s] employed by the Intel 
Xeon X5650 [^. The basic structure of such a page 
closure policy is presented in Figure 


Physical Address —► 0xC66EB 

Figure 4: HAPPY implementation of Hybrid Page 
Policy. 


Having done this, each physical address bit correlates 
with the possibility of getting page hit/conflict depend¬ 
ing on the value of that bit. Therefore, for a given phys¬ 
ical address the possibility of getting page hit/conflicts 
can be calcnlated simply by considering at all the par¬ 
ticipant bitaAZs counter values in the requested address 
and using one of the following techniques: Majority vote 
or Aggregation. 

Majority vote: Fignre explains this scheme us- 


Mistake Counter 



Figure 6: Intel-adaptive page policy predictor basic 
structure. 

The integrated memory controller used in this In¬ 
tel processor can be configured at boot time to employ 
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one of the following three different page closure policy 
schemes: close-page, fixed-open and Intel-adaptive open 
page. The fixed-open page poliey keeps a row open for a 
fixed period of time and closes it after that. The Intel- 
adaptive scheme is an advanced version of the fixed- 
open schemes. Similar to the fixed-open policy, in this 
structure, each row buffer within a bank has a Time¬ 
out Counter (TC) and a Timeout Register (TR). A row 
will be kept open until TC reaches the TR and then 
closed. However, the initial TR might not be a suitable 
value for all the benchmarks. Thus, the Intel-adaptive 
scheme provides a technique to update the TR at run 
time using a d-bit Mistake Counter (MC). Whenever 
a page conflict happens that could have been a page- 
empty, since there was enough time to precharge the 
last accessed row, then the MC is decremented. When¬ 
ever a page empty could have been a page-hit, since 
the row being accessed is the same as the last accessed 
row in that bank, then the MC is incremented. After a 
specific time interval the MC will be checked against a 
pre-predefined low and high threshold to see if either a 
less or more aggressive page closure policy is required. 
If the MC is higher than the high-threshold then the 
TR will be incremented to keep the accessed row open 
for a longer period and if the MC is lower than the 
low-threshold the TR will be decremented to close the 
accessed row sooner. 

Figure depicts the HAPPY implementation of the 
Intel-adaptive open page policy. This time the aim is 
to extract the timeout value for each row to be kept 
open from the physical address bits. We use the same 
methodology explained in the previous section; the only 
difference is that instead of using simple saturating coun¬ 
ters a Monitor Unit is dedicated to each physical address 
bit location. Each monitoring unit includes a MC and 
a TR with the same function as the original implemen¬ 
tation of Intel-adaptive page policy. A global timeout 
counter is still required to keep track of row closing and 
opening times on a per bank basis. 
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Physicai Address —► 0xC66EB 


applied per physical address bit basis rather than per 
bank. Moreover, instead of having a global timeout 
register per bank, the time period that a row can be 
kept open will be calculated from the aggregation of all 
the participant bits for an accessed physical address. 

To make a fair comparison between HAPPY and orig¬ 
inal implementation of this page policy, the size of the 
MC is chosen as before (e.g. d-bit) and the size of TR in 
each physical bit location is chosen small enough that 
the maximum time that a row can be kept open is equal 
to having one global TR per bank. 

3.4 HAPPY - Intuition 

• Observation: The main intuition behind HAPPY 
is based on the observation that addresses that 
are spatially close together tend to have a similar 
page-closure policy preference. HAPPY is devised 
to exploit such behavior by fine grain monitoring 
of physical address bits behavior. 

• Ensemble Methods: Although HAPPY was de¬ 
veloped by analysis and observation of the exper¬ 
iments, we believe that there are Machine Learn¬ 
ing principles that can justify the intuition be¬ 
hind HAPPY. A mathematical/theoretical frame¬ 
work that can explain HAPPY is that of Ensem¬ 
ble Methods [^. The family of algorithms cat¬ 
egorized as Ensemble methods combines multiple 
(normally simple) predictors. The theory explains 
how combining such predictors, it can be obtained 
a much improved predictor provided certain di¬ 
versity properties among the predictors are met. 
Random forest, and neural networks are exam¬ 
ples of very successful prediction algorithms part 
of the ensemble family. This paper addresses an 
online learning scenario and uses a fixed number of 
predictors with a non-linear combination function. 
When applying our technique to Intel-adaptive, we 
generate a pair of simple predictors per physical 
address bit and solve a regression problem. Each 
pair of predictors is trained using a different single 
physical bit (different features) and each member 
of the pair is trained using only Zero or One oc¬ 
currences (different dataset); mechanisms that can 
improve diversity. In Happy, we use two counters 
per physical address bits; e.g. 4GB is represented 
with 38 counters. If we were to limit ourselves to 
only a binary decision, then the maximum number 
of possible decisions that can be stored would be 2 
to the power of 38. If we increase to 8GB, then we 
use 40 counters, and thus the maximum number 
of possible decisions that can be stored would be 
2 to the power of 40. Thus as we increase the size 
of memory we are also increasing the number of 
possible decisions representable. 


Figure 7: HAPPY implementation of Intel-adaptive 
page policy predictor. 

Updating the MCs is as before but this time it is 


4. EVALUATION METHODOLOGY 

Page closure prediction algorithms for DRAMs are 
sensitive to the application memory access patterns. To 
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address this, we carry out an extensive evaluation, de¬ 
scribed as follows: 

Simulator: USIMM 
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, a detailed memory system 
simulator, is used as our main simulation platform. We 
extended USIMM to support five different, existing page 
closure policies (i.e. Open-Page, Close-Page, Hybrid, 
Fixed-Open and Intel-adaptive open page) plus the two 
implementations of HAPPY which are described in Sec¬ 
tion]^ (i.e. Hybrid-HAPPY and Intel-adaptive-HAPPY). 
The scheduling algorithm in the memory controller is 
FR-FCFS; first ready, first come first serve. We evalu¬ 
ated HAPPY based on different memory configurations, 
2 GB for single-thread and 4 GB for multithread work¬ 
loads. To increase the memory congestion we config¬ 
ured USIMM with 1 channel and 1 rank. The baseline 
USIMM system configuration parameters are captured 
in Table m 


Processor 

Glock Speed 

3.2GHz 

Pipeline depth 

10 

ROB size 

32 

DRAM Parameters 

DRAM Size 

2 - 4 GB 

Bus Speed 

800MHz 

Gonfiguration 

lGhannel,lRank,8Bankf 

Row per bank 

65,536 

Gache line per row 

128 


Table 2: Simulation paramteres. 


Address Mapping Schemes: The number of page 
conflicts in DRAMs and as a result the memory perfor¬ 
mance is susceptible to the memory address mapping 
scheme. The experiments consider 3 different address 
mappings presented in Figure The first mapping 1 
is a standard mappings to maximize row buffer locality. 
The next two address interleaving policies are stat e-of- 
the-art schemes proposed by Kaseridis et al. 12 and 


Zhang et al. 13 . The proposed mapping by Zhang et al. 


13 XORs part of the row address bit with the bank s ad - 
dress bits to produce a new bank index (see Figure 8b). 
Kaseridis et al. extend this technique by produc¬ 
ing the column index using different section of physical 
address bits (Figure [8c| . Both techniques aim to re¬ 
duce page conflicts in DRAMS. Our experiments shows 
that the minimalist open page policy (Mapping 3) per¬ 
forms better for most of the workloads. Therefore, this 
address mapping scheme is employed in all the exper¬ 
iments. Focusing on the best page closure policy (i.e. 
Intel-Adaptive-HAPPY), we also report the sensitivity 
of HAPPY for all the three address mappings schemes. 

Workloads: The workloads include a wide range of 
memory intensive applications (i.e. 48 wor kloads) from 
different benchmark suites (PARSEG [^, SPEG [T^ , 
BIOBENGH [T|, HPG and COMMERGIAL) and rep 


resentative regions of interest for each application. Ta¬ 
ble lists the workloads and their corresponding bench¬ 
mark suites. 

The USIMM simulator can run arbitrary multi-application 


Row 
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Block Offset 


(a) Mapping 1: Maximise row-buffer locality 


CH RA Bank Column Block Offset 


CH RA Bank 


Column Block Offset 


(b) Mapping 3: Permutation-based 
Interleaving 13 



Figure 8: Different Address Interleaving schemes. 


workloads using multiple traces. To increase the variety 
of memory access patterns, we set up USIMM for multi¬ 
applications to produce 22 random workload mixes; a 
combination of 4-thread, 8-thread and 16-thread appli¬ 
cations. Table m listed these multi-core workloads con¬ 
sidering the prefix of single thread workloads presented 
in Table Overall the experiments consider 70 work¬ 
load mixes. 

Memory Footprint (MF): To evaluate the perfor¬ 
mance of page closure predictors a careful study has to 
be carried out otherwise the performance and accuracy 
numbers might be misleading. For instance, if an ap¬ 
plication targets a very small portion of memory then 
it might be possible to predict its behavior using very 
small number of performance counter whereas if the ap¬ 
plication accesses all over of the memory space then it 
might be more difficult to keep track of the application 
access pattern with only a few counters (e.g. HAPPY). 
To have a fair evaluation methodology we made sure 
that the memory traces cover a wide range of access 
pattern. To this aim, we monitored the total physical 
pages accessed (Memory Footprint) per application at 
run time and we can confirm that our single thread ap¬ 
plications have the average MF of 30% (up to 97%), our 
4-thread workloads have the average MF of 50% (up to 
75%), our 8-thread workloads have the average MF of 
70% (up to 85%) and our 16-thread workloads have the 
average MF of 95% (up to 99.8%). 

5. RESULTS AND DISCUSSIONS 

This section analyzes the different page closure pol¬ 
icy prediction schemes compared with using HAPPY 
by looking at execution time, accuracy and scalability. 
Before jumping to the result graphs the following sum¬ 
mary might be helpful: 


• The HAPPY implementation of Hybrid page pol- 
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Benchmark Suites 

SPEC 

PARSEC 

COMMERCIAL 

(a) GemsEDTD r 

(k) astar B 

(u) canneal 

(Dl) comml 

(b) bzip2 l 

(1) bzip2 t 

(v) streamcluster 

(D2) comm2 

(c) cactusADM b 

(m) gcc l 

(w) blackschols 

(D3) comm3 

(d) gcc 2 

(n) gcc c 

(x) facesim 

(D4) comm4 

(e) gcc cp 

(o) gcc g 

(y) ferret 

(D5) comm5 

(f) gcc sc 

(p) mcf r 

(z) fluidanimate 

BIOBENCH 

(g) milc s 

(q) omnetpp o 

(A) freqmine 

(E) mummer 

(h) soplex r 

(r) sphinx3 a 

(B) swaption 

(E) tigr 

(i) xalancbmk r 

(s) zeusmp z 


(j) libquantum 

(t) leslie 



Table 3: Standard workloads and benchmark snites. 


icy is called Hybrid-HAPPY for brevity. 

• The HAPPY implementation of Intel-adaptive open- 
page policy is called Intel-adaptive-HAPPY for brevity. 

• The results in Figure [T0| and Figure [TlfT^ are nor¬ 
malized to the ‘static profiling’; the lower the bar 
the better performance. 

5.1 Prediction Accuracy 

Understanding the prediction accuracy for the differ¬ 
ent types of page closure predictors has its pitfalls. For 
instance, prediction accuracy in the case of Hybrid pre¬ 
dictors is straight forward as the prediction outcome is 
either opening or closing a page (binary classification). 
However, in the case of the Intel-adaptive technique the 
accuracy needs to be described based on the timeout 
value (regression). Consider a scenario where a page 
has to be open for 40 clock cycles to get a page hit and 
Intel-adaptive predicts 39 clock cycle. In this case the 
prediction accuracy should be calculated differently. 

To have a fair evaluation across all the predictors with 
a different nature of prediction, we calculate the pre¬ 
diction accuracy based on the Page-Hit and the Page- 
Miss prediction outcome. In fact, the main purpose 
of using page policy predictors is to increase the page 
hits and reduce the page misses in the DRAM. Thus, 
we calculate the Oracle page-hits (maximum possible 
page-hits when having a perfect predictors) and Oracle 
page-misses (minimum possible page-misses when hav¬ 
ing a perfect predictor) and evaluate the actual page- 
hits and page-misses occurred during execution time of 
each workloads against the oracle numbers. Figure 
presents the prediction accuracy (GMEAN) of differ¬ 
ent predictors across all the workloads evaluated in this 
work. The open-page and close-page policies deliver the 
maximum prediction accuracy for page-hits and page- 
misses, respectively. This happens because an open- 
page policy leaves all the page open and then it can 
cover all the possible page-hits in the system and non of 
the page misses. A close-page policy behaves similarly 
but in the opposite scenario. The hybrid-page policy 
delivers a moderate page-miss and page-hit prediction 
accuracy (around 60%). The Intel-adaptive and fixed- 
open both deliver a good prediction accuracy for both 


page-hits (80% and 75.8%) and page-misses (83.5% and 
90.4%) respectively. Overall, the HAPPY implementa¬ 
tion of both Intel-adaptive and hybrid are slightly more 
accurate than the original implementation. This predic¬ 
tion accuracy numbers justify the execution time pre¬ 
sented in FigureAlso, from the accuracy results this 
can be concluded that the page-hit prediction accuracy 
has a higher impact on the overall execution time than 
the page-miss prediction accuracy. 


■ Page-Hit ^ Page-Miss 



Figure 9: Prediction accuracy for different predictors. 


5.2 Performance Analysis 


Figure summarizes the performance of different 
prediction algorithms, normalized to the ‘Static Profil¬ 
ing’, for all the benchmarks. Each bargraph in this fig¬ 
ure represents the Geometrical Mean (GMEAN) of the 
execution time for the number of running workloads for 
each category. The detailed performance of the pre¬ 
diction algorithms for individual workloads is presented 
in Eigures IHHll These figures again confirm that a 
static page closure policy cannot deliver the optimum 
execution time for all the workloads. The correspond¬ 
ing workloads for HPG and SPEG benchmarks mostly 
prefer open-page policy. On the other hand, the cor¬ 
responding workloads for PARSEG, BIOBENGH and 
GOMMERGIAL workloads mostly prefer open-page pol¬ 


icy. 

Overview: Our experimental results show that the 
best page closure prediction scheme (i.e. Intel-adaptive- 
HAPPY) delivers 5% and 8% better average perfor¬ 
mance across all the workloads (up to 12% and 20%) 
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Figure 10: Average relative performance to static profiling for all the single-thread workloads. 
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Figure 11: Relative performance to static profiling for HPC workloads. 



Figure 12: Relative performance to static profiling for SPEC workloads. 
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Figure 13: Relative performance to static profiling for PARSEC, BIOBENCH and Commercial workloads. 
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Figure 14: Relative performance to static profiling for Multi-Core workloads. 





















































































■ Intel-adaptive ^ Intel-adaptive-HAPPY 


in comparison with open-page and close-page policy re¬ 
spectively. Overall, the HAPPY implementation of both 
Hybrid and Intel-adaptive achieved similar performance 
when compared with the original implementation of these 
page closure policies albeit with a much lower hardware 
overhead. Comparing the Intel-adaptive with the Intel- 
adaptive-HAPPY page policy shows that the HAPPY 
implementation can reduce the cost of implementation 
by 5x for the evaluated 64 GB memory size (up to 40 x 
for a memory size of 512 GB) while improving the per¬ 
formance up to 2%. Similar behavior can be observed 
for Hybrid and Hybrid-HAPPY. Hybrid-HAPPY shows 
182,000x reduction in cost of implementation for the 
evaluated 64 GB memory size (up to 1.2Mx for mem¬ 
ory size of 512 GB) while improving the performance 
up to 3% for some workloads. 

Similarly, as Figure [T^ presents, for multi-thread ap¬ 
plications, even with a very high MF, HAPPY perfor¬ 
mance is consistent and it delivers similar performance 
to the original implementation. The experimental re¬ 
sults show that Intel-adaptive-HAPPY delivers 5% and 
14% better average performance across all the work¬ 
loads (up to 9% and 22%) in comparison with open-page 
and close-page policy respectively. 

Sensitivity to Address Mapping Schemes: To 
investigate the sensitivity of HAPPY to different ad¬ 
dress mappings we select the best page closure policy 
(i.e. Intel-adaptive-HAPPY) across all the predictors 
presented in this paper and evaluate it with the three 
address mappings presented in Figure Figures [TS] and 
pT] illustrate the prediction accuracy of Intel-adaptive 
(original and HAPPY implementation) using the dif¬ 
ferent mapping schemes. These results show that the 
HAPPY implementation of Intel-adaptive always deliv¬ 
ers identical or slightly better results than the origi¬ 
nal implementation no matter which address mapping 
is used. 


■ Intel-adaptive ^ Intel-adaptive-HAPPY 
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Figure 15: Page-hit prediction accuracy with different 
address mappings. 


5.3 Sensitivity to Memory Size 

We have evaluated HAPPY for up to 64 GB DRAM 
size and the results shows that HAPPY has a consistent 
behaviour as the memory size increase. Our experimen¬ 
tal results suggests that the effective factor on HAPPY 
performance is the utilization of memory address space 
rather than the size of memory. For this reason, we con¬ 
sidered a 4 GB memory organization with up to 99.8% 
memory space utilization for our multithread experi¬ 
ments (results are presented in Figure 14). Even in 



Mapping 1 Mapping 2 Mapping 3 


Figure 16: Page-miss prediction accuracy with 
different address mappings. 


this situation, the results show that HAPPY delivers a 
competitive performance against the original implemen¬ 
tation of both Hybrid and Intel-adaptive page policies 
while reduces the hardware overhead significantly. 

5.4 Scalability with Memory Size 

Figure[l7|depicts the required storage (bytes) for each 
prediction algorithm for different sizes of memory. The 
HAPPY implementation of the hybrid prediction tech¬ 
nique is orders of magnitude (e.g. up to 1.2Mx) cheaper 
than the original implementation while we show that 
it delivers similar performance to original implementa¬ 
tion. In the case of Intel-adaptive page closure policy, 
the HAPPY implementation requires slightly more re¬ 
sources than the original implementation for memory 
sizes of less than 8 GB. However, as the memory size 
grows, the Intel-adaptive-HAPPY outperforms the seal- 
ability over the original implementation up to 40 x for 
a memory size of 512 GB. Tabledepicts the required 
performance counters for different page closure policies 
with and without HAPPY considering a memory sys¬ 
tem with X channels, Y ranks, Z banks and W rows. 


OHybrid AHybrid-HAPPY Xintel-adaptive ©Intel-adaptive-HAPPY 



Figure 17: Scalability of different page closure 
prediction algorithms. 


Implementation 

Required Counters 

Hybrid 

X xY X Z xW 

Hybrid-HAPPY 

(log2 A -1- log2 Y + log2 Z + log2 W) X 2 

Intel-adaptive 

{X xY X Z) x2 

Intel-HAPPY 

(log2 X + log2 Y + log2 Z -1- log2 IT) X 4 


Table 5: Required performance counters for different 
page closure policies. 
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Multi-Core Workloads 

MIXl 

(W-D3-D3-F) 

MIX12 

(b-n-s-w) 

MIX2 

(D1-D5-E-F) 

MIX13 

(w-Dl-D5-x-y-z-t-F) 

MIX3 

(Dl-x-y-F) 

MIX14 

(D1-D4-D5-X-Z-X-B-F) 

MIX4 

(D2-D4-A-F) 

MIX15 

(Dl-D4-D5-j-E-D5-v-F) 

MIX5 

(D2-D4-j-E) 

MIX16 

(D2-D4-D5-Z-A-A-D4-F) 

MIX6 

(D2-y-t-F) 

MIXl 7 

(C5-C6-u-l-e-o-p-h) 

MIX7 

(D4-D5-g-F) 

MIX18 

(C13-C14C17-C18-C21-C2-C4-v-k-l-c-m-e-n-h-s) 

MIX8 

(D4-X-X-F) 

MIX19 

(Cl3-C18-C21-C2-C6-u-v-C21-u-l-l-o-t-p-h-s) 

MIX9 

(x-y-z-F) 

MIX20 

(C14-C17-C21-C22-C2-C4-C5-C8-C14-C21-C4-k-e-o-a-p) 

MIXIO: (C21-C22-C4-b) 

MIX21 

(Cl7-C21-u-C17-q-q-i-t-o-b-o-a-t-p-q-i) 

MIXll: (C5-C6-u-e) 

MIX22 

(C18-C22-C5-C6-C8-u-k-l-d-e-n-o-p-q-h-r) 


Table 4: Randomly generated Multi-Core workloads. 


5.5 Prediction Algorithms - Weakness & Strength 

Due to the nature of implementation and structure of 
each predictor, each of them might or might not work 
in a specific situation. Here, we discuss such situations. 

Static Policies: the open-page policy works best for 
high locality workloads but degrades the performance 
of DRAMs significantly for the workloads with highly 
random or dynamic memory accesses. The close-page 
policy has the completely opposite behavior. PARSEC 
and SPEC workloads are good examples which show the 
different behavior of open-page and close-page policy. 

Fixed-Open: The performance of this type of al¬ 
gorithms is fairly susceptible to its predefined timeout 
value. Similarly to the methodology presented in |12| , 
we select this value to be equal to tRC: that is the min¬ 
imum time limitation between consecutive accesses to 
different rows within a bank, in the experiments. This 
time delay provides enough opportunity to capture a 
possible page hit; it does, according to the results pre¬ 
sented in Figure [To) However, for non-memory intensive 
threads with high locality or memory intensive with low 
locality (e.g. ‘mummer’ and ‘tigr’) this technique might 
not work well. The reason is that, for the first category, 
the time interval between memory requests might be 
higher than the fixed timeout value which means this 
technique will close the row before a page hit happens. 
Similarly, for the second category the time interval be¬ 
tween memory requests might be lower than the fixed 
timeout value, which means that a row would be kept 
open for an unnecessary time which, most likely, would 
lead to get other page conflicts. 

Hybrid: the integrated saturating counters employed 
in this category (either the original or HAPPY imple¬ 
mentation) train by the number of page-hits and page- 
conflicts that they face. Therefore, the prediction ac¬ 
curacy of these types of techniques is fairly sensitive to 
the distribution of page hits/conflicts within DRAMs. 

For instance, ‘streamcluster’ presents such a behavior. 

Intel-adaptive: our experiments show that this pre¬ 
diction algorithm is the best across all the presented 
schemes in this paper. However, one weakness of this 
technique is the updating granularity of TR. In our ex¬ 
periments, every time that checking of the MC suggests 


a more or less aggressive page closure policy, TR is in¬ 
cremented or decremented by one respectively. Updat¬ 
ing granularity by one step (increment or decrement) 
delivers a fine tuning of the TR but reduces the train¬ 
ing rate of the overall prediction technique. This means 
that, for workloads where the application access pattern 
behavior changes frequently (e.g. between high and low 
locality accesses pattern) within different time phases, 
the Intel-adaptive scheme might not be able deliver its 
best performance. Similar behavior can be observed in 
‘canneal’ or ‘comml’. 

HAPPY: so far we have just explained that advan¬ 
tages of HAPPY. However, like all the other proposed 
techniques, HAPPY also has weaknesses. Considering 
the global nature of a HAPPY implementation it is ex¬ 
pected that HAPPY cannot perform as efficiently as fine 
grain schemes for workloads with fairly dynamic behav¬ 
ior targeting a small part of DRAMs locally. This can 
be seen in workloads like ‘tigr’. 

5.6 Flexibility 

HAPPY is the proof of the concept that the physi¬ 
cal address bits can be the source of useful information 
that can be extracted using the right encoding and de¬ 
coding techniques. This makes HAPPY a fairly flexible 
tool that can be applied to different prediction algo¬ 
rithms that have not been practical due to the cost of 
implementation, making them feasible. In this paper we 
applied HAPPY to two completely different prediction 
schemes and showed how the performance and scalabil¬ 
ity of these scheme improved. However, page closure 
policies are not the only candidate that can take ad¬ 
vantage of the knowledge presented by HAPPY in this 
paper and we will present more interesting show cases 
in the near future. 

6. RELATED WORK 

Succinctly, prior research in this area can be catego¬ 
rized in two main groups: access-based and time-based 
techniques. 

Access-based techniques are those that monitor 
and keep a history of the row hits and row misses at dif¬ 
ferent granularity in DRAMs to make a prediction of the 
future page closure policy for each row or bank within a 
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DRAM memory system. Xu et al. proposed a two- 
level dynamic SDRAM policy predirtor which collects 
the row hit/miss behavior for the last n accesses in a 
history register. For each entry in the history register, 
there is a 2-bit saturating counter that keeps track of 
the page closure policy for each access. Huan et al. [I^ 
proposed the Processor-Directed dynamic page pon^ 
where the processor keeps track of the last row access 
to each bank to predict page hits or misses for future 
memory requests. The processor sends this information 
to the memory controller to specify the page closure for 
the next memory access. Awashti et al. [1^ keep track 
in a history table of the number of accesses each row 
has before closing it. When a row is open the number 
of expected accesses to that row is looked up, if there 
is no recorded entry for the accessed row in the his¬ 
tory table that row is kept open. However if there is 
an entry for the accessed row it will be closed after the 
expected number of accesses suggested by the history 


MB 
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ing the optimum time that a row can be left open. 
Blackmore 27 presented a quantitative analysis of page 
closure predictors. This work specifically focused on 
the Intel-adaptive page policy structure and tried to 
improve it by introducing the inter-arrival distribution 
concept. Stonkovic et al. used the concept of live¬ 
time and dead-time to preHict the page closure. The 
live-time is the time interval between opening a row 
until the last access to that row while dead-time is the 
interval from the last access to an open row until its 
closing. If the predictor predicts a zero live-time or if 
it predicts that the row has entered its dead-time pe¬ 
riod, then the row will be closed immediately after the 
DRAM access otherwise it will be kept open for future 
accesses. In another work, Kaseridis et al. used the 
concept that in DRAMs there is a minimum time limi¬ 
tation of tRc between two activations within a bank and 
speculatively leave the pages open for the tuc period. 

To sum up, HAPPY is the only technique which con¬ 
siders an encoding based on the memory address bits of¬ 
fering a compact means of storing history to inform the 
predictions. In addition, we have shown how to apply 
HAPPY to Time-based (i.e. Intel Adaptive HAPPY) 
and Access-based techniques (i.e. Hybrid HAPPY). 


7. CONCLUSIONS 

DRAM performance is dependent on the memory ac¬ 
cess pattern and, more specifically, the number of page- 
hit and page-conflicts that occur at run time. Modern 
DRAM controllers employ advanced page closure policy 
predictors to improve performance by trying to trans¬ 
form page-conflicts into page-empty (e.g. by closing the 
last accessed row at the “right time”), and page-empty 
cases into page-hits (e.g. by keeping open the last ac¬ 
cessed row for longer time). However the main challenge 
is to balance the prediction accuracy of these predictors 
with manageable hardware overheads (scalability) as we 
increase the size of DRAM. 


We have described HAPPY - a compact and efficient 
binary-encoding technique - to alleviate the scalability 
problem of DRAM page closure predictors. HAPPY 
relies on the simple observation that there is a strong 
correlation between the physical address bits of mem¬ 
ory addresses requested by processors and the inter¬ 
nal structure of the DRAM as there is a fixed-address 
mapping scheme. Thus, the physical address bits carry 
the information that a memory controller needs to pre¬ 
dict the page-hit or page-conflict for a particular ac¬ 
cess. Considering this, the required performance coun¬ 
ters and monitoring units needed by the page closure 
prediction algorithms can be encoded from the physical 
address bits. Doubling the size of DRAM only implies 
one extra physical address bit. This means that with 
HAPPY only one extra monitoring unit is required to 
predict the DRAM page closure policy when the size 
of memory is doubled. In other words, HAPPY offers 
the smallest hardware overhead to implement dynamic 
DRAM page closure predictor algorithms. 

We have evaluated HAPPY by applying it to a tradi¬ 
tional Hybrid page closure policy, as well as the state- 
of-the-art Intel-adaptive open page policy included in 
Intel Xeon X5650. The experimental results show that 
the HAPPY implementation of Intel-adaptive page pol¬ 
icy can reduce the cost of implementation by 5x for the 
evaluated 64 GB memory size (up to 40 x for a mem¬ 
ory size of 512 GB) while maintaining the prediction 
accuracy. The other case study shows 182,000 x reduc¬ 
tion in cost of implementation for the evaluated 64 GB 
memory size (up to 1.2Mx for memory size of 512 GB) 
while maintaining the prediction accuracy. The experi¬ 
ments have also reported the accuracy of the predictors 
and have studied the sensitivity towards the memory 
address-mapping. In both scenarios, HAPPY maintains 
its key advantage of offering no degradation of predic¬ 
tion accuracy while reducing significantly the hardware 
overhead. 
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